Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-15855: core pipeline API #16039

Merged
merged 11 commits into from
Feb 12, 2024
Merged

GH-15855: core pipeline API #16039

merged 11 commits into from
Feb 12, 2024

Conversation

sebhrusen
Copy link
Contributor

implements core API for #15855

@sebhrusen sebhrusen added the core label Jan 29, 2024
@sebhrusen sebhrusen added this to the 3.46.0.1 milestone Jan 29, 2024
@sebhrusen sebhrusen self-assigned this Jan 29, 2024
@sebhrusen sebhrusen linked an issue Jan 29, 2024 that may be closed by this pull request
* @param ignoredFields A {@link Set} of fields to ignore. Can be empty or null.
* @return checksum A 64-bit long representing the checksum of the object
*/
public static <T> long checksum(final T obj, final Set<String> ignoredFields, final long initVal) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic extracted for Model.Parameters so that it can also be used in pipeline DataTransformers

Comment on lines -239 to -242
private ModelBuilderListener _callback;

public void setCallback(ModelBuilderListener callback) {
this._callback = callback;
Copy link
Contributor Author

@sebhrusen sebhrusen Jan 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

callbacks have been moved from Driver to ModelBuilder itself as it allows much better configurability needed by the pipeline

@@ -71,18 +74,23 @@ protected int nModelsInParallel(int folds) {
* This is called before cross-validation is carried out
*/
@Override
public void computeCrossValidation() {
protected void cv_init() {
Copy link
Contributor Author

@sebhrusen sebhrusen Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes in the algos are mainly due to the fact that the CV API now only exposes protected hooks at various places in the model building cycle, otherwise it breaks the pipeline logic that needs a very strict behaviour when building CV models (esp. as it needs full control over the frames being used at that time).
Algos are therefore encouraged to override only those small hooks, and the ModelBuilder itself remains algo-agnostic.

assertNotNull(k.get());
assertVecEquals(fr.vec(i), k.get(), 1e-10);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an example with pipeline say multiply a column by 2 and then build a GLM or GBM or DRF model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wendycwong added a quick MultiplyNumericColumnTransformer for the test below, with assertions verifying that it's applied correctly and that the original frame is not modified.

@wendycwong
Copy link
Contributor

@sebhrusen : Is it possible for a user to add her own munging pipeline? For example, I want to transform a column by subtracting the column mean?

@sebhrusen
Copy link
Contributor Author

sebhrusen commented Feb 6, 2024

Is it possible for a user to add her own munging pipeline? For example, I want to transform a column by subtracting the column mean?

@wendycwong no, this is not currently supported:

  1. the current implementation focuses on the backend implementation: the client support is only here to manipulate—esp. be able to predict—pipeline models that have been built by AutoML for example.
  2. there are many ways to extend the pipeline logic and make it more customizable:
    1. in AutoML, the preprocessing param can used to select various predefined transformers that will make up the training pipeline.
    2. for single models and grids, the pipeline client can later be extended to allow the user to define its own pipeline (for example using a syntax similar to sklearn Pipeline).
    3. for even better customization, like the scenario you're suggesting, we could allow code ingestion—probably jython scripts, it should not be too difficult to implement a JythonDataTransformer.
  3. finally, let's keep in mind that Mojo support is also not supported yet, although likely to be much easier to support with this Pipeline mechanism than with the legacy Target encoding support embedded in Model/ModelBuilder for example, as every transformation now applies clearly sequentially and the estimator model (e.g. GLM) contains only post-transformation information, whereas with the legacy TE integration the model contained a mix of pre-encoding and post-encoding making the MOJO extremely difficult to implement due to other categorical encoding being mixed to this.

I'm answering more than your simple question here, but wanted to explain the scope of this, so maybe I will copy this answer to the original issue for reference.

tomasfryda
tomasfryda previously approved these changes Feb 9, 2024
Copy link
Contributor

@tomasfryda tomasfryda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just some minor comments, otherwise it looks great! Thank you @sebhrusen for not just hacking it together (as I probably would 😅 ) and inventing those nice abstractions!

model$estimator_model <- NULL
}
model$transformers <- unlist(lapply(model$transformers, function(dt) new("H2ODataTransformer", id=dt$id, description=dt$description)))
# class(model) <- "H2OPipeline"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this commented? If needed, you can assign multiple classes, e.g., class(model) <- c("H2OPipeline", "H2OModel").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point! right I forgot about the multiple inheritance. I think I commented it out because it didn't seem necessary (single algo don't have dedicated class) and it broke some behaviour somewhere (can't remember what exactly).
The funny part is that the class is defined as follow:

setClass("H2OPipeline", contains="H2OModel",

but afair, it was still not recognized as a model somewhere…
I will give a try to your suggestion, and see if it breaks some R tests.

h2o-bindings/bin/custom/R/gen_pipeline.py Outdated Show resolved Hide resolved

private enum Command {
SUBSTITUTE() {
private final Pattern CMD = Pattern.compile("s/(.*?)/(.*?)/?");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For other reviewers:
It took me couple minutes before I realized this is not a PCRE to substitute a non-greedy match with verbatim (.*?). The whole expression is actually the part for matching (not substitution). The expected string is s/something/something or s/something/something/.

Please correct me if I'm wrong @sebhrusen .

Copy link
Contributor Author

@sebhrusen sebhrusen Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct, the idea behind those KeyGen is to better understand how all those keys are created, and ensure consistency when some keys are created based on other ones. Pipeline can use a lot of keys for all temp frames, for models used in data transformers and so on.

and example of this substitution pattern is used in the pipeline integration in AutoML:

     // in ModelingStep.applyPipeline(...)
      pparams._estimatorKeyGen = hyperParams == null 
              ? new ConstantKeyGen(resultKey) 
              : new PatternKeyGen("{0}|s/"+PIPELINE_KEY_PREFIX+"//")  // in case of grid, remove the Pipeline prefix to obtain the estimator key, this allows naming compatibility with the classic mode.
              ;

this way the key formatting doesn't have to leak later in the code (in this case in Grid) that should be pipeline-agnostic

@@ -0,0 +1 @@
#hex.pipeline.PipelineRegistration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional # prefix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, right, I should remove this

import water.fvec.Frame;

/**
* WiP: not used for now.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you specify what remains to be done here? (if anything)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's mainly a high level abstraction for other transformers that would for example remove rows (not) containing a specific value or that have too few values…
As we don't use this in AutoML (yet), I didn't implement any specific transformer, and so this abstraction is probably missing useful stuff…
Basically, it's just an idea for now. I can delete it if you want.

h2o-r/h2o-package/R/kvstore.R Show resolved Hide resolved
h2o-r/h2o-package/R/classes.R Outdated Show resolved Hide resolved
h2o-core/src/main/java/hex/pipeline/PipelineHelper.java Outdated Show resolved Hide resolved
h2o-core/src/main/java/hex/pipeline/PipelineHelper.java Outdated Show resolved Hide resolved
@sebhrusen sebhrusen merged commit c15ea1e into master Feb 12, 2024
62 of 68 checks passed
@sebhrusen sebhrusen deleted the seb/gh-15855 branch February 12, 2024 13:02
mn-mikke added a commit that referenced this pull request Feb 27, 2024
valenad1 added a commit that referenced this pull request Mar 8, 2024
valenad1 added a commit that referenced this pull request Mar 11, 2024
* Revert "GH-15857: cleanup legacy TE integration in ModelBuilder and AutoML (#16061)"

This reverts commit a8f309b.

* Revert "GH-15857: AutoML pipeline support (#16041)"

This reverts commit 17fa9ee.

* Revert "GH-15856: Grid pipeline support (#16040)"

This reverts commit b7ac670.

* Revert "GH-15855: core pipeline API (#16039)"

This reverts commit c15ea1e.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AutoML Pipeline – Java API
3 participants